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Abstract This paper deals with the problem of estimating the volume of the excursion set of a function 
/ : K d -> R above a given threshold, under a probability measure on R d that is assumed to be known. In 
the industrial world, this corresponds to the problem of estimating a probability of failure of a system. 
When only an expensive-to-simulate model of the system is available, the budget for simulations is 
usually severely limited and therefore classical Monte Carlo methods ought to be avoided. One of the 
main contributions of this article is to derive SUR (stepwise uncertainty reduction) strategies from a 
Bayesian-theoretic formulation of the problem of estimating a probability of failure. These sequential 
strategies use a Gaussian process model of / and aim at performing evaluations of / as efficiently as 
possible to infer the value of the probability of failure. We compare these strategies to other strategies 
also based on a Gaussian process model for estimating a probability of failure. 

Keywords Computer experiments • Sequential design ■ Gaussian processes • Probability of failure ■ 
Stepwise uncertainty reduction 



1 Introduction 

The design of a system or a technological product has to take into account the fact that some design 
parameters are subject to unknown variations that may affect the reliability of the system. In particular, 
it is important to estimate the probability of the system to work under abnormal or dangerous operating 
conditions due to random dispersions of its characteristic parameters. The probability of failure of a 
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system is usually expressed as the probability of the excursion set of a function above a fixed threshold. 
More precisely, let / be a measurable real function defined over a probability space (X, Z?(X), Px), with 
X C R d , and let u £ ft be a threshold. The problem to be considered in this paper is the estimation of 
the volume, under Px, of the excursion set 

r := {x £ X : f(x) > u} (1) 

of the function / above the level it. In the context of robust design, the volume a := Px(^) can be viewed 
as the probability of failure of a system: the probability Px models the uncertainty on the input vector 
x £ X of the system — the components of which are sometimes called design variables or factors — and 
/ is some deterministic performance function derived from the outputs of a deterministic model of the 
system 1 . The evaluation of the outputs of the model for a given set of input factors may involve complex 
and time-consuming computer simulations, which turns / into an expensive-to-evaluate function. When / 
is expensive to evaluate, the estimation of a must be carried out with a restricted number of evaluations 
of /, generally excluding the estimation of the probability of excursion by a Monte Carlo approach. 
Indeed, consider the empirical estimator 

1 m 

i=l 

where the JQs are independent random variables with distribution Px- According to the strong law of 
large numbers, the estimator a m converges to a almost surely when m increases. Moreover, it is an 
unbiased estimator of ct, i.e. E(a m ) = ol. Its mean square error is 

E((a m - a) 2 ) = — a(l - a) . 

If the probability of failure a is small, then the standard deviation of a m is approximately ^Ja/m. To 
achieve a given standard deviation 5a thus requires approximately l/(S 2 a) evaluations, which can be 
prohibitively high if a is small. By way of illustration, if a = 2 x 10 -3 and S = 0.1, we obtain m = 50000. 
If one evaluation of / takes, say, one minute, then the entire estimation procedure will take about 35 
days to complete. Of course, a host of refined random sampling methods have been proposed to improve 
over the basic Monte Carlo convergence rate; for instance, methods based on importance sampling with 
cross-entropy (Rubinstein and Kroese, 2004), subset sampling (Au and Beck, 2001) or line sampling 
(Pradlwarter et al., 2007). They will not be considered here for the sake of brevity and because the 
required number of function evaluations is still very high. 

Until recently, all the methods that do not require a large number of evaluations of / were based 
on the use of parametric approximations for either the function / itself or the boundary dF of r. 
The so-called response surface method falls in the first category (see, e.g., Bucher and Bourgund, 1990; 
Rajashekhar and Ellingwood, 1993, and references therein). The most popular approaches in the second 
category are the first- and second-order reliability method (FORM and SORM), which are based on 
a linear or quadratic approximation of dT around the most probable failure point (see, e.g., Bjerager, 
1990). In all these methods, the accuracy of the estimator depends on the actual shape of either / or dT 



1 Stochastic simulators are also of considerable practical interest, but raise specific modeling and computational 
issues that will not be considered in this paper. 
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and its resemblance to the approximant: they do not provide statistically consistent estimators of the 
probability of failure. 

This paper focuses on sequential sampling strategies based on Gaussian processes and kriging, which 
can been seen as a non-parametric approximation method. Several strategies of this kind have been 
proposed recently in the literature by Ranjan et al. (2008), Bichon et al. (2008), Picheny et al. (2010) 
and Echard et al. (2010a, b). The idea is that the Gaussian process model, which captures prior knowledge 
about the unknown function /, makes it possible to assess the uncertainty about the position of F given a 
set of evaluation results. This line of research has its roots in the field of design and analysis of computer 
experiments (see, e.g., Sacks et al., 1989; Currin et al., 1991; Welch et al., 1992; Oakley and O'Hagan, 
2002, 2004; Oakley, 2004; Bayarri et al., 2007). More specifically, kriging-based sequential strategies for 
the estimation of a probability of failure are closely related to the field of Bayesian global optimization 
(Mockus et al., 1978; Mockus, 1989; Jones et al., 1998; Villemonteix, 2008; Villemonteix et al., 2009; 
Ginsbourger, 2009). 

The contribution of this paper is twofold. First, we introduce a Bayesian decision-theoretic framework 
from which the theoretical form of an optimal strategy for the estimation of a probability of failure can be 
derived. One-step lookahead sub-optimal strategies are then proposed 2 , which are suitable for numerical 
evaluation and implementation on computers. These strategies will be called SUR (stepwise uncertainty 
reduction) strategies in reference to the work of D. Geman and its collaborators (see, e.g. Fleuret and 
Geman, 1999). Second, we provide a review in a unified framework of all the kriging-based strategies 
proposed so far in the literature and compare them numerically with the SUR strategies proposed in 
this paper. 

The outline of the paper is as follows. Section 2 introduces the Bayesian framework and recalls the 
basics of dynamic programming and Gaussian processes. Section 3 introduces SUR strategies, from the 
decision-theoretic underpinnings, down to the implementation level. Section 4 provides a review of other 
kriging-based strategies proposed in the literature. Section 5 provides some illustrations and reports 
an empirical comparison of these sampling criteria. Finally, Section 6 presents conclusions and offers 
perspectives for future work. 

2 Bayesian decision-theoretic framework 

2.1 Bayes risk and sequential strategies 

Let / be a continuous function. We shall assume that / corresponds to a computer program whose output 
is not a closed-form expression of the inputs. Our objective is to obtain a numerical approximation of 
the probability of failure 

a(f) = P x {x G X : f(x) > u} = / 1 f>u dP x , (3) 

Jx 

where 1 f >u stands for the characteristic function of the excursion set F, such that for any x G X, 
1f >u (x) equals one if x € r and zero otherwise. The approximation of a(f) has to be built from a set 
of computer experiments, where an experiment simply consists in choosing an x G X and computing 

2 Preliminary accounts of this work have been presented in Vazquez and Piera-Martincz (2007) and Vazquez 
and Beet (2009). 
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the value of / at x. The result of a pointwise evaluation of / carries information about / and quantities 
depending on / and, in particular, about !/>« and a(f). In the context of expensive computer exper- 
iments, we shall also suppose that the number of evaluations is limited. Thus, the estimation of a(f) 
must be carried out using a fixed number, say N, of evaluations of /. 

A sequential non-randomized algorithm to estimate a(f) with a budget of N evaluations is a 
pair {X N ,a N ), 

x N : f^X N (f) = (X 1 (f),X 2 (f ),..., X N (f))eJL N , S N :f^a N if)eR+, 
with the following properties: 

a) There exists x\ £ X such that Xi(f) = x\, i.e. X\ does not depend on /. 

b) Let Z n (f) = f(X n (f)), 1 < n < N. For all 1 < n < N, X n+1 (f) depends measurably 3 on l n (f), 
where I n = ((Xi.Zi) , . . . , (X n , Z n )). 

c) Siv(/) depends measurably onXjy(/)- 

The mapping X N will be referred to as a strategy, or policy, or design of experiments, and o/y will be 
called an estimator. The algorithm (Xn>®n) describes a sequence of decisions, made from an increasing 
amount of information: X\(f) = x\ is chosen prior to any evaluation; for each n = 1, . . . , N — 1, the 
algorithm uses information T n (f) to choose the next evaluation point X n +i(f); the estimation 3jv(/) 
of a(f) is the terminal decision. In some applications, the class of sequential algorithms must be further 
restricted: for instance, when K computer simulations can be run in parallel, algorithms that query 
batches of K evaluations at a time may be preferred (see, e.g. Ginsbourger et al., 2010). In this paper 
no such restriction is imposed. 

The choice of the estimator Sjv will be addressed in Section 2.4: for now, we simply assume that 
an estimator has been chosen, and focus on the problem of finding a good strategy X N ; that is, one 
that will produce a good final approximation apf(f) of a(f). Let .4^ be the class of all strategies Xn 
that query sequentially N evaluations of /. Given a loss function i : R x R -> R, we define the error 
of approximation of a strategy Xn -^-N on / as e (X N ,f) = L(aN(f),oe(f)). In this paper, we shall 
consider the quadratic loss function, so that e(X N ,f) = (Sjy(/) — a (/)) 2 - 

We adopt a Bayesian approach to this decision problem: the unknown function / is considered as 
a sample path of a real- valued random process £ defined on some probability space {fl,B, Prj) with 
parameter in x £ X, and a good strategy is a strategy that achieves, or gets close to, the Bayes risk 
r-B := infx N eA N ^0 ( e (2C/v>£))i where Erj denotes the expectation with respect to Prj. From a subjective 
Bayesian point of view, the stochastic model £ is a representation of our uncertain initial knowledge 
about /. From a more pragmatic perspective, the prior distribution can be seen as a tool to define a 
notion of a good strategy in an average sense. Another interesting route, not followed in this paper, would 
have been to consider the minimax risk i^x w eA N max / Eo (e(AjV)O) over some class of functions. 

Notations. From now on, we shall consider the stochastic model £ instead of the deterministic 
function / and, for abbreviation, the explicit dependence on £ will be dropped when no there is no 
risk of confusion; e.g., 2jy will denote the random variable 2tv(C)i X n will denote the random variable 
An(£), etc. We will use the notations J- n , Pn and E„ to denote respectively the cr-algebra generated 
by X n , the conditional distribution Po ( ■ | J- n ) and the conditional expectation Eo ( - | T n ). Note that 

3 i.e., there is a measurable map <p n : (X X R)" — > X such that X n = ip n o I n 
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the dependence of X n +i on T n can be rephrased by saying that is J^-measurable. Recall that 

E„ (Z) is J^-measurable, and thus can be seen as a measurable function of T n , for any random variable Z. 



2.2 Optimal and fc-step lookahead strategies 

It is well-known (see, e.g., Berry and Fristedt, 1985; Mockus, 1989; Bertsekas, 1995) that an optimal 
strategy for such a finite horizon problem 4 , i.e. a strategy Xn ^ A/V such that Erj (e(2CiVi£)) = r B> can 
be formally obtained by dynamic programming: let Rn = E^v (eQ£jvi£)) = E-n{(&N ~ a ) 2 ) denote the 
terminal risk and define by backward induction 

Rn = min E n (R n +i I X n +i = x ) > n = N - 1, . . . , 0. (4) 

To get an insight into (4), notice that R n +i, n = 0, ...,N — 1, depends measurably on I„+i = 
Jf n _)-i, Z n +i), so that E n (ii„_|_i | X n +\ — x) is in fact an expectation with respect to Z n +l, 
and R n is an J^-measurable random variable. Then, we have Rq = rg, and the strategy X* N defined by 

X n+ i = argmin E„ (i? n+ i | X n+ i = a;) , n = 1, . . . , N - 1, (5) 

is optimal 5 . It is crucial to observe here that, for this dynamic programming problem, both the space 
of possible actions and the space of possible outcomes at each step are continuous, and the state space 
(X x H) n at step n is of dimension n(d + 1). Any direct attempt at solving (4)-(5) numerically, over an 
horizon N of more than a few steps, will suffer from the curse of dimensionality. 
Using (4), the optimal strategy can be expanded as 

X* +1 = argmin E„ I min E n+ i . . . min E w _x R N X n+ i = x 

A very general approach to construct sub-optimal — but hopefully good — strategies is to truncate this 
expansion after k terms, replacing the exact risk R n +k by any available surrogate R n +k- Examples of 
such surrogates will be given in Sections 3 and 4. The resulting strategy, 



X n+ i = argmin E„ min E n+ i . . . min E n+k _ 1 R n+k X n+1 = x . (6) 

is called a k-step lookahead strategy (see, e.g., Bertsekas, 1995, Section 6.3). Note that both the opti- 
mal strategy (5) and the fc-step lookahead strategy implicitly define a sampling criterion J n (x), T n - 
measurable, the minimum of which indicates the next evaluation to be performed. For instance, in the 
case of the fc-step lookahead strategy, the sampling criterion is 

J n {x) = E„ I min E n+ i . . . min E n+fe _i R n+k X n+ i = x 



4 in other words, a sequential decision problem where the total number of steps to be performed is known from 
the start 

5 Proving rigorously that, for a given Pq and 2jvi equations (4) and (5) actually define a (measurable!) strategy 

£ Aff is technical problem that is not of primary interest in this paper. This can be done for instance, 
in the case of a Gaussian process with continuous covariance function (as considered later), by proving that 
x \— > E n (R n .)-i | X n _|_i(£) = x) is a continuous function on X and then using a measurable selection theorem. 
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In the rest of the paper, we restrict our attention to the class of one-step lookahead strategies, which 
is, as we shall see in Section 3, large enough to provide very efficient algorithms. We leave aside the 
interesting question of whether more complex fe-step lookahead strategies (with k > 2) could provide a 
significant improvement over the strategies examined in this paper. 

Remark 1 In practice, the analysis of a computer code usually begins with an exploratory phase, during 
which the output of the code is computed on a space-filling design of size no < N (see, e.g., Santner 
et al., 2003). Such an exploratory phase will be colloquially referred to as the initial design. Sequential 
strategies such as (5) and (6) are meant to be used after this initial design, at steps no + 1, . . . , N. An 
important (and largely open) question is the choice of the size no of the initial design, for a given global 
budget N. As a rule of thumb, some authors recommend to start with a sample size proportional to the 
dimension d of the input space X, for instance no = 10 d ; see Loeppky et al. (2009) and the references 
therein. 



2.3 Gaussian process priors 

Restricting £ to be a Gaussian process makes it possible to deal with the conditional distributions P n 
and conditional expectations E n that appear in the strategies above. The idea of modeling an unknown 
function / by a Gaussian process has originally been introduced circa 1960 in time series analysis (Parzen, 
1962), optimization theory (Kushner, 1964) and geostatistics (see, e.g., Chiles and Delfiner, 1999, and the 
references therein) . Today, the Gaussian process model plays a central role in the design and analysis of 
computer experiments (see, e.g., Sacks et al., 1989; Currin et al., 1991; Welch et al., 1992; Santner et al., 
2003). Recall that the distribution of a Gaussian process £ is uniquely determined by its mean function 
m(x) := Eo(5(x)), x £ X, and its covariance function k(x,y) := Eq ((£(x') — m(x))(t;(y) — m(y))), x,y £ 
X. Hereafter, we shall use the notation £ ~ GP (to, k) to say that £ is a Gaussian process with mean 
function m and covariance function k. 

Let £ ~ GP (0, k) be a zero-mean Gaussian process. The best linear unbiased predictor (BLUP) 
of £(:r) from observations £(xi), i = 1, . . . , n, also called the kriging predictor of £(x), is the orthogonal 
projection 

n 

t{x;x n ) := ^2\i(x;x n )£(xi) (7) 
?:=i 

of £(x) onto span{^(xi), i = l,...,n}. Here, the notation x n stands for the set of points x n = 
{xi, . . . ,x n }- The weights \i(x;x n ) are the solutions of a system of linear equations 

Kx n >Xn)K x ;Xn) = (8) 

where k(x n ,x n ) stands for the n x n covariance matrix of the observation vector, X(x;x n ) = 
(^l(x;x n ), . . . , X n (x;x n )) T , and k(x,x n ) is a vector with entries k{x,x{). The function x h-> £(x;x n ) 
conditioned on £(xi) = f{x\), . . . ,^(x n ) — f(x n ), is deterministic, and provides a cheap surrogate model 
for the true function / (see, e.g., Santner et al., 2003). The covariance function of the error of prediction, 
also called kriging covariance is given by 

Hx,y;x n ) := covU(x)-£(x;x n ),£(y)-£(y-,x n )\ 

= k(x,y)-Y]K(x;x n )k(y,Xi). (9) 
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The variance of the prediction error, also called the krigmg variance, is defined as a 2 (x; x n ) = k(x,x;x_ n ). 
One fundamental property of a zero-mean Gaussian process is the following (see, e.g., Chiles and Delfiner, 
1999, Chapter 3) : 

Proposition 1 If £ ~ GP (0, k), then the random process £ conditioned on the a-algebra T n generated 
by £ (xi), . . . , £(xn)> which we shall denote by £ | Tn, is a Gaussian process with mean £(•; x n ) given 
by (7)-(8) and covariance k ( • , • ; x n ) given by (9). In particular, £(x;x n ) = Eq^(x) | Tn) is the best 
Tn -measurable predictor of £(x), for all x £ X. 

In the domain of computer experiments, the mean of a Gaussian process is generally written as a 
linear parametric function 

m(.) = p T h(.), (10) 

where /3 is a vector of unknown parameters, and h — (hi, . . . , ft;) T is an /-dimensional vector of functions 
(in practice, polynomials). The simplest case is when the mean function is assumed to be an unknown 
constant m, in which case we can take /3 = m and h : x € X \-} 1. The covariance function is generally 
written as a translation-invariant function: 

k : (x, y) G X 2 i-> a 2 p g (x - y) , 

where a 2 is the variance of the (stationary) Gaussian process and pg is the correlation function, which 
generally depends on a parameter vector 9. When the mean is written under the form (10), the kriging 
predictor is again a linear combination of the observations, as in (7), and the weights Xi(x;x n ) are again 
solutions of a system of linear equations (see, e.g., Chiles and Delfiner, 1999), which can be written under 
a matrix form as 

) Kx n ) \ / Hx;x n ) \ K x ,x n ) 
h(x n ) J \ /i(x) J \ h(x) 

where h(x n ) is an I x n matrix with entries hi(xj), i = 1, . . . , I, j = 1, . . . , n, fx is a vector of Lagrange 
coefficients (k(x n ,x n ), \(x;x n ), k(x,x n ) as above). The kriging covariance function is given in this case 
by 

k(x,y,x n ) := covU(x)-^(x;x n ),^(y)-^(y;x n )] 

= H x ,y) - H x ;x n ) T k (y,x n ) - K x ) T Kv) ■ ( 12 ) 

The following result holds (Kimeldorf and Wahba, 1970; O'Hagan, 1978): 

Proposition 2 Let k be a covariance function. 

U\ m ~ GP(m, k) ,~ \ 

If { then i\ T n ~ GV{i(.;x n ), k(-,-;x n ) , 

[m: a; !-»• /3 A (as), f3 ~ W Ri v y 

where li^i stands for the (improper) uniform distribution over M} , and where £( • ;x n ) and fc( ■ , • ; a; n ) 
are given by (7), (11) ana" (12). 

Proposition 2 justifies the use of kriging in a Bayesian framework provided that the covariance function 
of £ is known. However, the covariance function is rarely assumed to be known in applications. Instead, 
the covariance function is generally taken in some parametric class (in this paper, we use the so-called 
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Matern covariance function, see Appendix A). A fully Bayesian approach also requires to choose a prior 
distribution for the unknown parameters of the covariance (see, e.g., Handcock and Stein, 1993; Kennedy 
and O'Hagan, 2001; Paulo, 2005). Sampling techniques (Monte Carlo Markov Chains, Sequential Monte 
Carlo...) are then generally used to approximate the posterior distribution of the unknown covariance 
parameters. Very often, the popular empirical Bayes approach is used instead, which consists in plugging- 
in the maximum likelihood (ML) estimate to approximate the posterior distribution of £. This approach 
has been used in previous papers about contour estimation or probability of failure estimation (Picheny 
et al., 2010; Ranjan et al., 2008; Bichon et al., 2008). In Section 5.2 we will adopt a plug-in approach as 
well. 

Simplified notations. In the rest of the paper, we shall use the following simplified notations when 
there is no risk of confusion: £ n (x) := £(x',Xn), o n (x) := a 2 (x; Xn)- 



2.4 Estimators of the probability of failure 



Given a random process £ and a strategy Xn, t ne optimal estimator that minimizes Erj ((a — S n ) 2 ) 
among all J^-measurable estimators a n , 1 < n < N, is 



a„ 



E n (a) = E„ Q^l £> „dPx) = ^p„dP X: 



where 



p n ■ x e X i-> P„ {£(x) > u} 



(13) 



(14) 



When f; is a Gaussian process, the probability p n (x) of exceeding u at x £ X given X n has a simple 
closed-form expression: 



p n (x) 



1 - $ 



U - £n{x) 
Vn(x) 



>I> 



£n(x) — U 
(Tn{x) 



(15) 



where $ is the cumulative distribution function of the normal distribution. Thus, in the Gaussian case, 
the estimator (13) is amenable to a numerical approximation, by integrating the excess probability p n 
over X (for instance using Monte Carlo sampling, see Section 3.3). 

Another natural way to obtain an estimator of ct given T n is to approximate the excess indicator 
lf>« by a hard classifier rj n : X — > {0, 1}, where "hard" refers to the fact that r\ n takes its values 
in {0, 1}. If r\ n is close in some sense to 1^ >u , the estimator 



in = / r) n 



dP, 



should be close to a. More precisely, 



-.„ ((an - a) 2 ) = E n ^ (nn ~ lf>„)dP x ^ < j E n ((77„ - t (>u f) dP 5 



(16) 



(17) 



Let T n {x) = Pn{r)n{x) / l^( x )>„} = E ra ((rjn{x) — l^( x ) >u ) 2 ) be the probability of misclassification; 
that is, the probability to predict a point above (resp. under) the threshold when the true value is under 
(resp. above) the threshold. Thus, (17) shows that it is desirable to use a classifier rj n such that r n is 
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small for all x £ X. For instance, the method called SMART (Deheeger and Lemaire, 2007) uses a support 
vector machine to build r\ n - Note that 

T n(x) = Pn(x) + (l-2pn(x))Tln(x). 

Therefore, the right-hand side of (17) is minimized if we set 

T] n (x) = lp n ( a .)>i/ 2 = 1 i n (x)>u . ( 18 ) 
where £n(x) denotes the posterior median of £(x). Then, we have 

T n (x) = mm(p n (x), 1 -p n (x)). 

In the case of a Gaussian process, the posterior median and the posterior mean are equal. Then, the 
classifier that minimizes T n (x) for each x € X is r\ n = >u , in which case 

Mx) = Pn ((£(*) - u)(&(x) - u) < 0) = 1 - * ^^j~ M ^ • (19) 

Notice that for r\ n = Ij- >u , we have S n = a(£n). Therefore, this approach to obtain an estimator of a 
can be seen as a type of plug-in estimation. 

Standing assumption. It will assumed in the rest of the paper that £ is a Gaussian process, or more 
generally that £ | T n ~ GP(£ n , k( • , • ; £„)) for all n > 1 as in Proposition 2. 



3 Stepwise uncertainty reduction 

3.1 Principle 

A very natural and straightforward way of building a one-step lookahead strategy is to select greedily each 
evaluation as if it were the last one. This kind of strategy, sometimes called myopic, has been successfully 
applied in the field of Bayesian global optimization (Mockus et al., 1978; Mockus, 1989), yielding the 
famous expected improvement criterion later popularized in the Efficient Global Optimization (EGO) 
algorithm of Jones et al. (1998). 

When the Bayesian risk provides a measure of the estimation error or uncertainty (as in the present 
case), we call such a strategy a stepwise uncertainty reduction (SUR) strategy. In the field of global 
optimization, the Informational Approach to Global Optimization (IAGO) of Villemonteix et al. (2009) 
is an example of a SUR strategy, where the Shannon entropy of the minimizer is used instead of the 
quadratic cost. When considered in terms of utility rather than cost, such strategies have also been called 
knowledge gradient policies by Frazier et al. (2008). 

Given a sequence of estimators (Sn) n>1 , a direct application of the above principle using the 
quadratic loss function yields the sampling criterion 

J n {x) = E n ((a - 3„+i) 2 I X n+1 = . (20) 

Having found no closed-form expression for this criterion, and no efficient numerical procedure for its 
approximation, we will proceed by upper-bounding and discretizing (20) in order to get an expression 
that will lend itself to a numerically tractable approximation. By doing so, several SUR strategies will 
be derived, depending on the choice of estimator (the posterior mean (13) or the plug-in estimator (16) 
with (18)) and bounding technique. 
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3.2 Upper bounds of the SUR sampling criterion 

Recall that T n {x) = rmn(p n (x), 1 — p n (x)) is the probability of misclassification at x using the optimal 
classifier lg- ( x \ >u - Let us further denote by v n (x) := p n (x) (1 — Pn(x)) the variance of the excess 
indicator l^( x )> u - 



Proposition 3 Assume that either a n = E„ (a) or a n — J Ij- >t( dPx- Define G n '■= J x \/ Jn(y)dPj 
for all n £ {0, . . . , N — 1}, with 



In 



!Vn = Pn{l — Pn) = T„(l — T„) , if OL n = E n (a) , 
T n = min(p„ , 1 - p n ) , ifa n =J 1 ^ > u 

Then, for all x £ X and aZZ n G {0, . . . , JV — 1}, 

Jn(x) < J n {x) := E„ (C n +1 | =xj . 

Note that 'yn(x) is a function of p n {x) that vanishes at and 1, and reaches its maximum at 1/2; that 
is, when the uncertainty on 1? , K is maximal (see Figure 1). 

Proof First, observe that, for all n > 0, a — S„ = /L/„dP x , with 

rr rr ,\ \ 1 e(x)>u ~ Pn(x) if 5„ = E„ (a) , 

{/„ : x G X M> U n {x) = < (21) 
[ - 1 if = / 1 f„>n dR X ■ 

Moreover, note that 7« = ||fn|| n in both cases, where || • ||„ : L 2 (Q,B,P) -» £ 2 («,.F„,P), W H> 
En(W 2 ) 1 ^ 2 . Then, using the generalized Minkowski inequality (see, e.g., Vestrup, 2003, section 10.7) we 
get that 

|/[/ n dPx| n < J\\U n \\ n 4Px = J v^dPx = G n . (22) 
Finally, it follows from the tower property of conditional expectations and (22) that, for all n > 0, 

Jn{x) = E n (\\a - 3„ + i|| 2 l+1 | X n +1 = x) 
= E n (J|/ U n +1 dPx||^ +1 I X n +1 = 35 

< E n (G 2 +1 I X„ + i = ar) . 



□ 

Note that two other upper-bounding sampling criteria readily follow from those of Proposition 3, by 
using the Cauchy-Schwarz inequality in I? (X,6(X), Px): 



Jn(x) < E n U 7„+l dP X | X n +1 = x \ . (23) 

As a result, we can write four SUR criteria, whose expressions are summarized in Table 1. Criterion 
J\ V n nas been proposed in the PhD thesis of Piera-Martinez (2008) and in conference papers (Vazquez 
and Piera-Martinez, 2007; Vazquez and Beet, 2009); the other ones, to the best of our knowledge, 
are new. Each criterion is expressed as the conditional expectation of some (possibly squared) J- n +\- 
measurable integral criterion, with an integrand that can be expressed as a function of the probability 
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0.2 0.4 0.6 0.8 1 

Pn 



Fig. 1 7 n as a function of p n (see Proposition 3). In both cases, 7 n is maximum at p n = 1/2. 

of misclassification T n +\. It is interesting to note that the integral in J| UR is the integrated mean square 
error (IMSE) 6 for the process t^ >u . 

Remark 2 The conclusions of Proposition 3 still hold in the general case when £ is not assumed to be a 
Gaussian process, provided that the posterior median £ n is substituted to posterior the mean £„. 

Table 1 Expressions of four SUR-type criteria. 



SUR-typc sampling criterion 


How it is obtained 


jf u n R (x) = E„ 


((/v^TTdPx) 2 


X, 


l+l 


=-) 


Prop. 3 with a n — 


JHn>u AP * 


= En 


((/ V^TTdPx) 2 


X, 


l+l 


=•) 


Prop. 3 with a n — 


E„ (a) 


jITw = E " 


(|r„ +1 dP x 


X, 


l+l 


=•) 


Eq. (23) with S n = 


; / 1 5„> u dPx 


J*™( x ) = E„ 




X, 


l+l 


=») 


Eq. (23) with 5„ = 


= E„ (a) 



3.3 Discretizations 

In this section, we proceed with the necessary integral discretizations of the SUR criteria to make them 
suitable for numerical evaluation and implementation on computers. Assume that n steps of the algorithm 
have already been performed and consider, for instance, the criterion 

Jf™(x) = E n (/ r n+1 (y) P x (dy) | X n+1 = xj . (24) 

Remember that, for each y £ X, the probability of misclassification r„_)_i(t/) is J- n +l -measurable and, 
therefore, is a function of I n +i = {X n ,X n ^.\,Z n ^ r {). Since T n is known at this point, we introduce the 

6 The IMSE criterion is usually applied to the response surface £ itself (see, e.g., Box and Draper, 1987; Sacks 
et al., 1989). The originality here is to consider the IMSE of the process lj> ll instead. Another way of adapting 
the IMSE criterion for the estimation of a probability of failure, proposed by Picheny et al. (2010), is recalled in 
Section 4.2. 
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notation v n +i(y; X n +i, Z n +\) = T n +i(y) to emphasize the fact that, when a new evaluation point must 
be chosen at step (n + 1), T n +i(y) depends on the choice of X n +i and the random outcome Z n -\-i. Let 
us further denote by Q n ,x the probability distribution of £(a;) under P n . Then, (24) can be rewritten as 

J^n(x) =11 v n+1 (y; x, z) Q„, x (dz) P x (dy) , 

J JRxX 

and the corresponding strategy is: 

X n+ i = argmin // v n+ i(y; x, z) Q n ,x(dz) Px(dy) . (25) 
iex Jjexx 

Given I n and a triple (x, y, z) , u n +i (y; x, z) can be computed efficiently using the equations provided in 
Sections 2.3 and 2.4. 

At this point, we need to address: 1) the computation of the integral on X with respect to Px; 2) 
the computation of the integral on R with respect to Q n ,x', 3) the minimization of the resulting criterion 
with respect to a; £ X. 

To solve the first problem, we draw an i.i.d. sequence Y\, . . . , Y m ~ Px an d use the Monte Carlo 
approximation: 

f 1 ™ 

/ v n+ i(y;x,z) Px(dj/) ~ — Y] v n+ i(Yj; x, z). 
Jjl 171 j=1 

An increasing sample size n h-> m n should be used to build a convergent algorithm for the estimation 
of a (possibly with a different sequence Y n i, . . . , Yn,m„ at each step). In this paper we adopt a different 
approach instead, which is to take a fixed sample size m > and keep the same sample Y\ , . . . , Y m 
throughout the iterations. Equivalently, it means that we choose to work from the start on a discretized 
version of the problem: we replace Px by the empirical distribution Px,n = ^ 2j=l &Yj> an< ^ our 8 oa l 
is now to estimate the Monte Carlo estimator a m = / l^>„dPx,n = m EJLi lf(Vj)>u! using either 
the posterior mean E„ (a m ) = ^,^ZjVn{Yj) or the plug-in estimate — ^ry-X )>«' This kind of 
approach has be coined meta- estimation by Arnaud et al. (2010): the objective is to estimate the value 
of a precise Monte Carlo estimator of a(f) (m being large), using prior information on / to alleviate the 
computational burden of running m times the computer code /. This point of view also underlies the 
work in structural reliability of Hurtado (2004, 2007), Deheeger and Lemaire (2007), Deheeger (2008), 
and more recently Echard et al. (2010a,b). 

The new point of view also suggests a natural solution for the third problem, which is to replace 
the continuous search for a minimizer x £ X by a discrete search over the set X rn := {Y\,... ,Y m }- 
This is obviously sub-optimal, even in the meta-estimation framework introduced above, since picking 
x € X\X rn can sometimes bring more information about £(Yi), . . . , (,(Y m ) than the best possible choice 
in X m . Global optimization algorithms may of course be used to tackle directly the continuous search 
problem: for instance, Ranjan et al. (2008) use a combination of a genetic algorithm and local search 
technique, Bichon et al. (2008) use the DIRECT algorithm and Picheny et al. (2010) use a covariance- 
matrix-adaptation evolution strategy. In this paper we will stick to the discrete search approach, since 
it is much simpler to implement (we shall present in Section 3.4 a method to handle the case of large m) 
and provides satisfactory results (see Section 5). 

Finally, remark that the second problem boils down to the computation of a one-dimensional integral 
with respect to Lebesgue's measure. Indeed, since £ is a Gaussian process, Q n ,x is a Gaussian probability 
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distribution with mean £n(x) and variance <Jn(x) as explained in Section 2.3. The integral can be com- 
puted using a standard Gauss-Hermite quadrature with Q points (see, e.g., Press et al., 1992, Chapter 4) 

v n +l(y;x,z)Q n ,x(dz) « —j= ^ w q v n+ i (y; x, £ n (x) + a n (x)uqV2) , 

J q=l 

where Mi,... ,uq denote the quadrature points and w\,... ,wq the corresponding weights. Note that 
this is equivalent to replacing under P n the random variable £(ar) by a quantized random variable with 
probability distribution Yl^=i w 'q^z n+l q {x)-> where w' q = w q / '^pK and z n+ i. q (x) — £ n (x) + o n {x)u q y/2. 
Taking all three discretizations into account, the proposed strategy is: 

m Q 

X n +1 = argmin ^ ^ w' q v n+ i (Yj; Y k , z n+ i_ q (Y k )) . (26) 

l<k<m j=1 g=1 



3.4 Implementation 

This section gives implementation guidelines for the SUR strategies described in Section 3. As said in 
Section 3.3, the strategy (26) can, in principle, be translated directly into a computer program. In practice 
however, we feel that there is still room for different implementations. In particular, it is important to 
keep the computational complexity of the strategies at a reasonable level. We shall explain in this section 
some simplifications we have made to achieve this goal. 

A straight implementation of (26) for the choice of an additional evaluation point is described in 
Table 2. This procedure is meant to be called iteratively in a sequential algorithm, such as that described 
for instance in Table 3. Note that the only parameter to be specified in the SUR strategy (26) is Q, which 
tunes the precision of the approximation of the integral on R with respect to Q n ,x- In our numerical 
experiments, it was observed that taking Q = 12 achieves a good compromise between precision and 
numerical complexity. 

To assess the complexity of a SUR sampling strategy, recall that kriging takes 0(mn 2 ) operations 
to predict the value of / at m locations from n evaluation results of / (we suppose that m > n and 
no approximation is carried out). In the procedure to select an evaluation, a first kriging prediction is 
performed at Step 1 and then, m different predictions have to performed at step 2.1. This cost becomes 
rapidly burdensome for large values of n and m, and we must further simplify (26) to be able to work on 
applications where m must be large. A natural idea to alleviate the computational cost of the strategy 
is to avoid dealing with candidate points that have a very low probability of misclassification, since they 
are probably far from the frontier of the domain of failure. It is also likely that those points with a 
low probability of misclassification will have a very small contribution in the variance of the error of 
estimation a n — a m ■ 

Therefore, the idea is to rewrite the sampling strategy described by (26), in such a way that the 
first summation (over m) and the search set for the minimizer is restricted to a subset of points Yj 
corresponding to the mo largest values of T n (Yj). The corresponding algorithm is not described here for 
the sake of brevity but can easily be adapted from that of Table 2. Sections 5.2 and 5.3 will show that 
this pruning scheme has almost no consequence on the performances of the SUR strategies, even when 
one considers small values for tuq (for instance mj = 200). 
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Table 2 Procedure to select a new evaluation point X n +i £ X using a SUR strategy 
Require computer representations of 

a) a set I n = {(Xi, f(Xi)), . . . , (X n , f{X n ))} of evaluation results; 

b) a Gaussian process prior £ with a (possibly unknown linear parametric) mean function and a covariancc 
function kg, with parameter 9; 

c) a (pscudo-)random sample X m = {Yl, . . . , Y m } of size m drawn from the distribution Px; 

d) quadrature points Mi,... ,uq and corresponding weights w' 1} . . . ,w'q\ 
c) a threshold u. 

1. compute the kriging approximation /„ and kriging variance a n on X m from I n 

2. for each candidate point Yj, j £ {1, . . . , m}, 

2.1 for each point Yj., k S {1, . . . , m}, compute the kriging weights A; (Yj. ; {X n , Y/}),iS{l,...,(n + 1)}, and 
the kriging variances <t 2 (Yj.; {Xn, Yj}) 

2.2 compute z n +l,q(Yj) = fn{Yj) + a n iYj)u q y/2, for q = 1, . . . , Q 

2.3 for each z n +l,q{Yj), q £{!,■■■ , Q}, 

2.3.1 compute the kriging approximation f n +l,j,q on X m from I n U (Yj,/(Yj) = z n +l,q(Yj)), using the 
weights Ai(Yfe; {Xn, Yj}), i = 1, , (n + 1), k = 1, . . . , m, obtained at Step 2.1. 

2.3.2 for each k S {1, . . . , m}, compute v n +i (Y^; Yj, z n +l,q(Yj)), using u, f n +ljq obtained in 2.3.1, and 
v 2 (Xk\{Xn, Yj}) obtained in 2.1 

2.4 compute J n (Y 3 ) = £™=1 E?=i < «n+l ^. 2n+l, g (Yj)). 

3. find 3* = argmin^ J n (Y, ) and set X n +± = Yj* 



Table 3 Sequential estimation of a probability of failure 



1. Construct an initial design of size no < N and evaluate / at the points of the initial design. 

2. Choose a Gaussian process £ (in practice, this amounts to choosing a parametric form for the mean of £ and 

a parametric covariance function kg) 

3. Generate a Monte Carlo sample X m = {Yl, . . . , Y m } of size m from Px 

4. While the evaluation budget N is not exhausted, 

4.1 optional step: estimate the parameters of the covariance function (case of a plug-in approach); 

4.2 select a new evaluation point, using past evaluation results, the prior § and X m ; 

4.3 perform the new evaluation. 

5. Estimate the probability of failure obtained from the N evaluations of / (for instance, by using Ejv (ot m ) = 

^PNiYj)). 



4 Other strategies proposed in the literature 

4.1 Estimation of a probability of failure and closely related objectives 

Given a real function / denned over X C R d , and a threshold u 6 R, consider the following possible 
goals: 

1. estimate a region fclof the form f = {i 6 X f(x) > u}; 

2. estimate the level set dr = {x G X f(x) = u}; 

3. estimate / precisely in a neighborhood of 8T; 
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4. estimate the probability of failure a = Px(-T) f° r a given probability measure Px- 

These different goals are, in fact, closely related: indeed, they all require, more or less explicitly, to select 
sampling points in order to get a fine knowledge of the function / in a neighborhood of the level set OF 
(the location of which is unknown before the first evaluation). Any strategy proposed for one of the first 
three objectives is therefore expected to perform reasonably well on the fourth one, which is the topic 
of this paper. 

Several strategies recently introduced in the literature are presented in Sections 4.2 and 4.3, and will 
be compared numerically to the SUR strategy in Section 5. Each of these strategies has been initially 
proposed by its authors to address one or several of the above objectives, but they will only be discussed 
in this paper from the point of view of their performance on the fourth one. Of course, a comparison 
focused on any other objective would probably be based on different performance metrics, and thus could 
yield a different performance ranking of the strategies. 



4.2 The targeted IMSE criterion 

The targeted IMSE proposed in Picheny et al. (2010) is a modification of the IMSE (Integrated Mean 
Square Error) sampling criterion (Sacks et al., 1989). While the IMSE sampling criterion computes the 
average of the kriging variance (over a compact domain X) in order to achieve a space-filling design, 
the targeted IMSE computes a weighted average of the kriging variance for a better exploration of the 
regions near the frontier of the domain of failure, as in Oakley (2004). The idea is to put a large weight 
in regions where the kriging prediction is close to the threshold u, and a small one otherwise. Given X n , 
the targeted IMSE sampling criterion, hereafter abbreviated as tIMSE, can be written as 

J n 1MSE (x) = E„ (Jj$ - W„dP x | X n+1 = x^j (27) 

a 2 (y;X 1 ,...,X n ,x) W n (y) P x (dy), (28) 

where W n is a weight function based on X n . The weight function suggested by Picheny et al. (2010) is 

w *'-=^b-'(-K% 2 )')- 

where s n (x) = <xf + o n (x). Note that W n (x) is large when £ n (x) ~ u and o n (x) fa 0, i.e., when the 
function is known to be close to u. 

The tIMSE criterion operates a trade-off between global uncertainty reduction (high kriging variance 
a n ) and exploration of target regions (high weight function Wn)- The weight function depends on a 
parameter a £ > 0, which allows to tune the width of the "window of interest" around the threshold. 
For large values of a £ , J tIMSE behaves approximately like the IMSE sampling criterion. The choice of an 
appropriate value for a e , when the goal is to estimate a probability of failure, will be discussed on the 
basis of numerical experiments in Section 5.3. 

The tIMSE strategy requires a computation of the expectation with respect to £(x) in (27), which 
can be done analytically, yielding (28). The computation of the integral with respect to Px on X can be 
carried out with a Monte Carlo approach, as explained in Section 3.3. Finally, the optimization of the 
criterion is replaced by a discrete search in our implementation. 
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4.3 Criteria based on the marginal distributions 

Other sampling criteria proposed by Ranjan et al. (2008), Bichon et al. (2008) and Echard et al. (2010a, b) 
are briefly reviewed in this section 7 . A common feature of these three criteria is that, unlike the SUR and 
tIMSE criteria discussed so far, they only depend on the marginal posterior distribution at the considered 
candidate point x £ X, which is a Gaussian Af(£, n (x), cr^(a;)) distribution. As a consequence, they are 
of course much cheaper to compute than integral criteria like SUR and tIMSE. 

A natural idea, in order to sequentially improve the estimation of the probability of failure, is to visit 
the point x € X where the event {£(2) > u} is the most uncertain. This idea, which has been explored 
by Echard, Gayton, and Lemaire (2010a,b), corresponds formally to the sampling criterion 

J^(x) = r n (x) = 1 - $ ( hzlppi ) . (30) 

As in the case of the tIMSE criterion and also, less explicitly, in SUR criteria, a trade-off is realized 
between global uncertainty reduction (choosing points with a high a n (x)) and exploration of the neigh- 
borhood of the estimated contour (where \u — £71(2;) j is small). 

The same leading principle motivates the criteria proposed by Ranjan et al. (2008) and Bichon et al. 
(2008), which can be seen as special cases of the following sampling criterion: 

J™(x) := E„(max(0, e (^-|u-^(x)| 5 )), (31) 

where e(x) = Ka n {x), k, S > 0. The following proposition provides some insights into this sampling 
criterion: 

Proposition 4 Define G K s : ]0, 1[ — > K+ by 

G KtS (p) := E( maX (0,K 5 - l*- 1 ^) + t/|)) , 

where U is a Gaussian A/"(0, 1) random variable. Let <p and <2> denote respectively the probability density 
function and the cumulative distribution function of U . 

a) G„, ff (p) = G K>S (1 - p) for all p G ]0, 1[. 

b) G K $ is strictly increasing on ]0, 1/2] and vanishes at 0. Therefore, G K $ is also strictly decreasing 
on [1/2, 1[, vanishes at 1, and has a unique maximum at p = 1/2. 

c) Criterion (31) can be rewritten as 

J% B (x) = a n (x) 6 G K! s(p n (x)). (32) 

d) G K 1 has the following closed-form expression: 

G K ,i( P ) = «(#(«+)-*(*-)) 

- f (2<5(t)-<P(f + )-0(O) ( 33 ) 

- (2<p(t)-<p(t + )-<p(t-)), 

where t = (1 — p), t + = t + k and t~ = t — k. 



7 Note that the paper of Ranjan et al. (2008) is the only one in this category that does not address the problem 
of estimating a probability of failure (i.e., Objective 4 of Section 4.1). 
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e ) G K: 2 has the following closed-form expression: 

G K , 2 (p) = ^ 2 -l-t 2 ) (*(t+) -*(*-)) 

- 2t (v(t + ) - ^(O) (34) 
+ t + ip{t + ) -t~ip{t~), 

with the same notations. 

It follows from a) and c) that Jn B (x) can also be seen as a function of the kriging variance cr^(x) 
and the probability of misclassification r n (x) = min(p n (2;), 1 — p n (x)). Note that, in the computation 
of G K .$(pn(x)) , the quantity denoted by t in (33) and (34) is equal to (u — £n(x)) /o~n(x), i.e., equal to 
the normalized distance between the predicted value and the threshold. 

Bichon et al.'s expected feasibility function corresponds to (32) with 5 = 1, and can be computed 
efficiently using (33). Similarly, Ranjan et al.'s expected improvement® function corresponds to (32) 
with 8 = 2, and can be computed efficiently using (34). The proof of Proposition 4 is provided in 
Appendix B. 

Remark 3 In the case 5 = 1, our result coincides with the expression given by Bichon et al. (2008, 
Eq. (17)). In the case 8 — 2, we have found and corrected a mistake in the computations of Ranjan et al. 
(2008, Eq. (8) and Appendix B). 



5 Numerical experiments 

5.1 A one-dimensional illustration of a SUR strategy 

The objective of this section is to show the progress of a SUR strategy in a simple one-dimensional case. 
We wish to estimate a = Px{/ > 1}, where / : X = E, — ► R is such that Va; G R, 

f(x) = (0.4s - 0.3) 2 + exp (-11.534 M 1 ' 95 ) + exp(-5(x - 0.8) 2 ) , 

and where X is endowed with the probability distribution Px = Af(0,a^}, ax = 0.4, as depicted in 
Figure 2. We know in advance that a « 0.2. Thus, a Monte Carlo sample of size m = 1500 will give a 
good estimate of a. 

In this illustration, £ is a Gaussian process with constant but unknown mean and a Matern covariance 
function, whose parameters are kept fixed, for the sake of simplicity. Figure 2 shows an initial design of 
four points and the sampling criterion J^„^_ 4 . Notice that the sampling criterion is only computed at 
the points of the Monte Carlo sample. Figures 3 and 4 show the progress of the SUR strategy after a 
few iterations. Observe that the unknown function / is sampled so that the probability of excursion p n 
almost equals zero or one in the region where the density of Px is high. 

8 Despite its name and some similarity between the formulas, this criterion should not be confused with the 
well-known EI criterion in the field of optimization (Mockus et al., 1978; Jones et al., 1998). 
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5.2 An example in structural reliability 

In this section, we evaluate all criteria discussed in Section 3 and Section 4 through a classical benchmark 
example in structural reliability (see, e.g., Borri and Speranzini, 1997; Waarts, 2000; Schueremans, 2001; 
Deheeger, 2008). Echard et al. (2010a,b) used this benchmark to make a comparison among several 
methods proposed in Schueremans and Gemert (2005), some of which are based on the construction of 
a response surface. The objective of the benchmark is to estimate the probability of failure of a so-called 
four-branch series system. A failure happens when the system is working under the threshold u = 0. The 
performance function / for this system is defined as 



The uncertain input factors are supposed to be independent and have standard normal distribution. 
Figure 5 shows the performance function, the failure domain and the input distribution. Observe that / 
has a first-derivative discontinuity along four straight lines originating from the point (0,0). 

For each sequential method, we will follow the procedure described in Table 3. We generate an initial 
design of no = 10 points (five times the dimension of the factor space) using a maximin LHS (Latin 
Hypercube Sampling) 9 on [—6; 6] x [—6; 6]. We choose a Monte Carlo sample of size m = 30000. Since 
the true probability of failure is approximately a = 0.4% in this example, the coefficient of variation 
for a m is \j\frna « 9%. The same initial design and Monte Carlo sample are used for all methods. 

A Gaussian process with constant unknown mean and a Matern covariance function is used as our 
prior information about /. The parameters of the Matern covariance functions are estimated on the 
initial design by REML (see, e.g. Stein, 1999). In this experiment, we follow the common practice of 
re-estimating the parameters of the covariance function during the sequential strategy, but only once 
every ten iterations to save some computation time. 

The probability of failure is estimated by (13). To evaluate the rate of convergence, we compute the 
number n 7 of iterations that must be performed using a given strategy to observe a stabilization of the 
relative error of estimation within an interval of length 27: 



All the available sequential strategies are run 100 times, with different initial designs and Monte Carlo 
samples. The results for 7 = 0.10, 7 = 0.03 and 7 = 0.01 are summarized in Table 4. We shall consider 
that 710.1 provides a measure of the performance of the strategy in the "initial phase", where a rough 
estimate of a is to be found, whereas no. 03 and no. 01 measure the performance in the "refinement phase". 

The four variants of the SUR strategy (see Table 1) have been run with Q = 12 and either tuq — 10 
or mo = 500. The performance are similar for all four variants and for both values of rriQ. It appears, 
however, that the criterions Jf UR and Jf UR 2 (i.e., the criterions given directly by Proposition 3) are 
slightly better than Jf UR and Jf UR ; this will be confirmed by the simulations of Section 5.3. It also 

9 More precisely, we use Matlab's lhsdesignO function to select the best design according to the maximin 
criterion among 10 4 randomly generated LHS designs. 



/ : (sci, X2) € R n> f(x\, X2) = min < 



' 3 + 0.1(a:i - x 2 ) 2 - (xi + x 2 )/V2; ' 
3 + 0.1(a:i - x 2 ) 2 + (X! + x 2 )/V2; 
(xi-x 2 ) + 6/V2; 
K (x 2 -x 1 ) + 6/V2 
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seems that the SUR algorithm is slightly slower to obtain a rough estimate of the probability of failure 
when mo is very small, but performs very well in the refinement phase. (Note that mo = 10 is a drastic 
pruning for a sample of size m = 30000.) 

The tIMSE strategy has been run for three different values of its tuning parameter erf, using the 
pruning scheme with mo = 500. The best performance is obtained for erf ~ 0, and is almost as good 
as the performance of SUR stragies with the same value of mO (a small loss of performance, of about 
one evaluation on average, can be noticed in the refinement phase). Note that the required accuracy 
was not reached after 200 iterations in 17% of the runs for erf = 1. In fact, the tIMSE strategy tends 
to behave like a space-filling strategy in this case. Figure 6 shows the points that have been evaluated 
in three cases: the evaluations are less concentrated on the boundary between the safe and the failure 
region when erf = 1. 

Finally, the results obtained for J RB and J EGL indicate that the corresponding strategies are clearly 
less efficient in the "initial phase" than strategies based on Jf UR or Jf UR - For 7 = 0.1, the average 
loss with respect to Jf UR is between approximately 0.9 evaluations for the best case (criterion J RB with 
8 = 2, k = 2) and 3.9 evaluations for the worst case. For 7 = 0.03, the loss is between 1.4 evaluations 
(also for (criterion J RB with 8 = 2, k = 2) and 3.5 evaluations. This loss of efficiency can also be observed 
very clearly on the 90 th percentile in the inital phase. Criterion J RB seems to perform best with 8 — 2 
and k = 2 in this experiment, but this will not be confirmed by the simulations of Section 5.3. Tuning 
the parameters of this criterion for the estimation of a probability of failure does not seem to be an easy 
task. 



Table 4 Comparison of the convergence to a m in the benchmark example Section 5.2 for different sampling 
strategies. The first number (bold text) is the average value of n 7 over 100 runs. The numbers between brackets 
indicate the 10 th and 90 th percentile. 



criterion 


parameters 


7 = 0.10 


7 = 0.03 


7 = 0.01 


rSUR 
J l 


m = 500 
mo = 10 


16.1 [10-22] 
19.4 [11-28] 


25.7 [17-35] 
28.1 [19-38] 


36.0 [26-48] 
35.4 [26-44] 


rSUR 
J 2 


m = 500 
mo = 10 


16.4 [10-24] 
20.0 [11-30] 


25.7 [19-33] 
28.3 [20-39] 


35.5 [25-45] 
35.3 [26-44] 




m = 500 
mo = 10 


18.2 [10-27] 
20.1 [11-30] 


26.9 [18-37] 
28.0 [20-36] 


35.9 [27-46] 
35.2 [25-44] 


jsur 


m = 500 
mo = 10 


17.2 [10-28] 
21.4 [13-30] 


26.5 [20-36] 
28.9 [20-38] 


35.2 [25-45] 
35.5 [27-44] 


jtlMSE 


a 2 s = lO" 6 
erf = 0.1 
-1 = 1 


16.6 [10-23] 
15.9 [10-22] 

21.7 [11-31] 


26.5 [19-36] 
29.1 [19-43] 
52.4 [31-85] 


37.3 [28-49] 
50.5 [30-79] 
79.5 [42-133] M 


JEGL 




21.0 [11-31] 


29.2 [21-39] 


36.4 [28-44] 


JRB 


5 = 1, k = 0.5 
5 = 1, K = 2.0 
5 = 2, K = 0.5 
S = 2, K = 2.0 


18.7 [10-27] 
18.9 [11-28] 
17.6 [10-24] 
17.0 [10-21] 


27.5 [20-35] 
28.3 [21-35] 

27.6 [20-34] 
27.1 [20-34] 


36.6 [27-44] 

37.7 [30-45] 
37.1 [29-45] 

36.8 [29-44] 



(*) The required accuracy was not reached after 200 iterations in 17% of the runs 
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Table 5 Size of the initial design and covariancc parameters for the experiments of Section 5.3. The parametriza- 
tion of the Matern covariance function used here is defined in Appendix A. 



d 


no 


a 2 


V 


P 


1 


3 


1.0 


2.0 


0.100 


2 


10 


1.0 


2.0 


0.252 


3 


15 


1.0 


2.0 


0.363 



5.3 Average performance on sample paths of a Gaussian process 

This section provides a comparison of all the criteria introduced or recalled in this paper, on the basis of 
their average performance on the sample paths of a zero-mean Gaussian process defined on X = [0, l] d , 
for d £ {1, 2, 3}. In all experiments, the same covariance function is used for the generation of the sample 
paths and for the computation of the sampling criteria. We have considered isotropic Matern covariance 
functions, whose parameters are given in Table 5. An initial maximin LHS design of size no (also given 
in the table) is used: note that the value of n reported on the x-axis of Figures 7-11 is the total number 
of evaluations, including the initial design. 

The d input variables are assumed to be independent and uniformly distributed on [0, 1], i.e., Px is 
the uniform distribution on X. An m-sample Y\, . . . , Y m from Px is drawn one and for all, and used 
both for the approximation of integrals (in SUR and tIMSE criteria) and for the discrete search of the 
next sampling point (for all criteria). We take m = 500 and use the same MC sample for all criteria in 
a given dimension d. 

We adopt the meta-estimation framework as described in Section 3.3; in other words, our goal is to 
estimate the MC estimator a m . We choose to adjust the threshold u in order to have a m = 0.02 for all 
sample paths (note that, as a consequence, there are exactly ma m = 10 points in the failure region) and 
we measure the performance of a strategy after n evaluations by its relative mean-square error (MSE) 
expressed in decibels (dB): 



rMSE := 




where a.m,n = m£j=lPn (Yj) is the posterior mean of the MC estimator a m after n evaluations on 
the Z th simulated sample path (L = 4000). 

We use a sequential maximin strategy as a reference in all of our experiments. This simple space-filling 
strategy is defined by X n +\ — argmax^ mini<j< n \Yj — X;|, where the argmax runs over all indices j 
such that Yj £ {X\, . . . , X n } ■ Note that this strategy does not depend on the choice of a Gaussian 
process model. 

Our first experiment (Figure 7 ) provides a comparison of the four SUR strategies proposed in 
Section 3.2. It appears that all of them perform roughly the same when compared to the reference 
strategy. A closer look, however, reveals that the strategies Jf UR and J| UR provided by Proposition 3 
perform slightly better than the other two (noticeably so in the case d = 3). 

The performance of the tIMSE strategy is shown on Figure 8 for several value of its tuning parameter 
erf (other values, not shown here, have been tried as well). It is clear that the performance of this strategy 
improves when erf goes to zero, whatever the dimension. 
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The performance of the strategy based on J^® is shown on Figure 9 for several values of its parame- 
ters. It appears that the criterion proposed by Bichon et al. (2008), which corresponds to 5 = 1, performs 
better than the one proposed by Ranjan et al. (2008), which corresponds to S = 2, for the same value 
of k. Moreover, the value k — 0.5 has been found in our experiments to produce the best results. 

Figure 10 illustrates that the loss of performance associated to the "pruning trick" introduced in 
Section 3.4 can be negligible if the size mo of the pruned MC sample is large enough (here, mo has been 
taken equal to 50). In practice, the value of mo should be chosen small enough to keep the overhead 
of the sequential strategy reasonable — in other words, large values of mo should only be used for very 
complex computer codes. 

Finally, a comparison involving the best strategy obtained in each category is presented on Figure 11. 
The best result is consistently obtained with the SUR strategy based on Jf^. The tIMSE strategy with 
cr £ ~ provides results which are almost as good. Note that both strategies are one-step lookahead 
strategies based on the approximation of the risk by an integral criterion, which makes them rather 
expensive to compute. Simpler strategies based on the marginal distribution (criteria J^ B and J^ GL ) 
provide interesting alternatives for moderately expensive computer codes: their performances, although 
not as good as those of one-step lookahead criterions, are still much better than that of the reference 
space-filling strategy. 



6 Concluding remarks 

One of the main objectives of this paper was to present a synthetic viewpoint on sequential strategies 
based on a Gaussian process model and kriging for the estimation of a probability of failure. The starting 
point of this presentation is a Bayesian decision-theoretic framework from which the theoretical form 
of an optimal strategy for the estimation of a probability of failure can be derived. Unfortunately, 
the dynamic programming problem corresponding to this strategy is not numerically tractable. It is 
nonetheless possible to derive from there the ingredients of a sub-optimal strategy: the idea is to focus 
on one-step lookahead suboptimal strategies, where the exact risk is replaced by a substitute risk that 
accounts for the information gain about a expected from a new evaluation. We call such a strategy a 
stepwise uncertainty reduction (SUR) strategy. Our numerical experiments show that SUR strategies 
perform better, on average, than the other strategies proposed in the literature. However, this comes at 
a higher computational cost than strategies based only on marginal distributions. The tIMSE sampling 
criterion, which seems to have a convergence rate comparable to that of the SUR criterions when erf w 0, 
also has a high computational complexity. 

In which situations can we say that the sequential strategies presented in this paper are interesting 
alternatives to classical importance sampling methods for estimating a probability of failure, for instance 
the subset sampling method of Au and Beck (2001)? In our opinion, beyond the obvious role of the 
simulation budget N, the answer to this question depends on our capacity to elicit an appropriate prior. 
In the example of Section 5.2, as well as in many other examples of the literature using Gaussian processes 
in the domain of computer experiments, the prior is easy to choose because X is a low-dimensional space 
and / tends to be smooth. Then, the plug-in approach which consists in using ML or REML to estimate 
the parameters of the covariance function of the Gaussian process after each new evaluation is likely 
to succeed. If X is high-dimensional and / is expensive to evaluate, difficulties arise. In particular, our 
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sampling strategies do not take into account our uncertain knowledge of the covariance parameters, and 
there is no guarantee that ML estimation will do well when the points are chosen by a sampling strategy 
that favors some localized target region (the neighboorhood the frontier of the domain of failure in this 
paper, but the question is equally relevant in the field optimization, for instance). The difficult problem 
of deciding the size no of the initial design is crucial in this connection. Fully Bayes procedures constitute 
a possible direction for future research, as long as they don't introduce an unacceptable computational 
overhead. Whatever the route, we feel that the robustness of Gaussian process-based sampling strategies 
with respect to the procedure of estimation of the covariance parameters should be addressed carefully 
in order to make these methods usable in the industrial world. 

Software. We would like to draw the reader's attention on the recently published package Kriglnv (Picheny and 
Ginsbourger, 2011) for the statistical computing environment R (see Hornik, 2010). This package provides an 
open source (GPLv3) implementation of all the strategics proposed in this paper. Please note that the simulation 
results presented in this paper were not obtained using this package, that was not available at the time of its 
writing. 
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Fig. 2 Illustration of a SUR strategy. This figure shows the initial design. Top: threshold u = 1 (horizontal 
dashed line) ; function / (thin line) ; n = 4 initial evaluations (squares) ; kriging approximation f n (thick line) ; 95% 
confidence intervals computed from the kriging variance (shaded area). Middle: probability of excursion (solid 
line); probability density of Px (dotted line). Bottom: graph of Jf^^L^Yi), i = 1, . . . ,m = 1500, the minimum of 
which indicates where the next evaluation of / should be done (i.e., near the origin). 
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Fig. 3 Illustration of a SUR strategy (see also Figures 2 and 4). This figure shows the progress of the SUR 
strategy after two iterations — a total of n = 6 evaluations (squares) have been performed. The next evaluation 
point will be approximately at x = —0.5 
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Fig. 5 Left: mesh plot of the performance function / corresponding to the four-branch series system; a failure 
happens when / is below the transparent plane; Right: contour plot of /; limit state / = (thick line); sample of 
size m = 3 X 10 3 from Px (dots). 
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Fig. 6 The first 16 points (squares) evaluated using sampling criterion Jf VR (left). J tIMSE with <r| =0.1 (middle), 
jtiMSE w j^ n a 2 _ i (right). Numbers near squares indicate the order of evaluation. The location of the no = 10 
points of the initial design are indicated by circles. 
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Appendix 



A The Matern covariance 

The exponential covariance and the Matern covariance are among the most conventionally used stationary co- 
variances in the literature of design and analysis of computer experiments. The Matern covariance class (Yaglom, 
1986) offers the possibility to adjust the regularity of £ with a single parameter. Stein (1999) advocates the use 
of the following paramctrization of the Matern function: 

Ku {h) = 2 ^_ 1 1 r( , : (2v 1 ' 2 hy K v (2i// 2 fe) , h e R (35) 

where r is the Gamma function and K v is the modified Bcssel function of the second kind. The parameter if > 
controls regularity at the origin of the function. To model a real- valued function / defined over X C R d , with 
d > 1, we use the following anisotropic form of the Matern covariance: 



l(x, y) = a 2 K v 



1 = 1 



£ ( * w 7 M)a ), x,y^ 



where denote the i th coordinate of x and y, the positive scalar a 2 is a variance parameter (we have 

kg(x, x) = a 2 ), and the positive scalars pi represent scale or range parameters of the covariance, i.e., characteristic 
correlation lengths. Since a 2 > 0, v > 0,Pi > 0, i = l,...,d, we can take the logarithm of these scalars, 
and consider the vector of parameters 9 = {log a 2 , log i/, — log pi — log p c {\ £ R ti + 2 , which is a practical 
parameterization when a 2 ,v, pi, i = 1 , . . . , d, need to be estimated from data. 



B Proof of Proposition 4 

a) Using the identity <f — 1 (1 — p) = — <P~ 1 (p), we get 

|c/ + #-i(i-p)| = ([/.^-i^i I \u+0- x ( P )\, 

where = denotes an equality in distribution. Therefore G K: s(l — p) = G Kt s(p). 

b) Let S p = max(0, «° — \<P~ 1 (p) + Straightforward computations show that 1 1— > P (|t + U\ < v) is strictly 
decreasing to on [0, +oo[, for all v > 0. As a consequence, p i— > P(5 P < s) is strictly increasing to 1 on [1/2, 1[, 
for all s £ ]0,k i5 [. Therefore, G K $ is strictly decreasing on [1/2, 1[ and tends to zeros when p — > 1. The other 
assertions then follow from a). 

c) Recall that f (a;) ~ J\f(£„(x), a 2 (x)) under P n . Therefore U := (? (x) - £ n (x)) /a n (x) ~ J\f(0, 1) under P n , and 
the result follows by substitution in (31). 



The closed-form expressions of Ranjan et al.'s and Bichon and al.'s criteria (assertions d) and e)) is established 
in the following sections. 
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B.l A preliminary decomposition common to both criteria 
Recall that i = <? _1 (1 -p), t+ = t + K and t~ = t - k. Then, 

G K , S (P) = G KiS (l-p) = E(max(o,/c*- \t - U\ 6 )) 
= f (k 6 -\t- u\ 6 ) <p(u)du 

rt + 

= / U s - \t-u\ 6 \ ip(u)du 

= k s (<2>(t+ )-<f>(t")) - J \t - u| s <p(u) du . (37) 



The computation of the integral A will be carried separately in the next two sections for 5 = 1 and 5 = 2. For 
this purpose, we shall need the following elementary results: 



,6 

/ utp(u)du = tp(a) — <p(b) , 

J a 
rb 

I u 2 ip(u)du 

J a 



J (t — u) 2 ip(u) du 

rt+ rt+ pt+ 

t 2 / <p{u) du — 2t / u(p{u) du + / u 2 tp(u) du 
Jt~ Jt~ Ji- 



lt 

+2 



(38) 



: ay{a) - b<p(b) + 0(b) - <P(a) . (39) 



B.2 Case 8 = 1 

Let us compute the value A\ of the integral A for 5 = 1: 

,t+ ,t ,t + 

Ai = \t — u\ tp(u)du = (t — u)ip(u)du + / (u — t)ip(u)du 

Jt~ Jt~ Jt 

= t f{u) du — J f(u) du^j — J uip(u) du + J wpiu) du 

= t (2*(t) - <P(t- )-<?(*+)) + 2<p(t) - <p(t-) - <p(t+) , (40) 
where (38) has been used to get the final result. Plugging (40) into (37) yields (33). 

B.3 Case 6 = 2 

Let us compute the value A2 of the integral A for 5 = 2: 
A-> 



= t 2 (*(t+) - *(t-)) - 2t (<p(t-) - <p(t+)) 

+ t-<p(t~) - t+ip(t+) + *(t+) - *(t~) , (41) 

where (38) and (39) have been used to get the final result. Plugging (40) into (37) yields (34) . 
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